NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Contextual Document Embeddings

Morris, John X; Rush, Alexander M (April 2025, ICLR)

Dense document embeddings are central to neural retrieval. The dominant paradigm is to train and construct embeddings by running encoders directly on individual documents. In this work, we argue that these embeddings, while effective, are implicitly out-of-context for targeted use cases of retrieval, and that a contextualized document embedding should take into account both the document and neighboring documents in context - analogous to contextualized word embeddings. We propose two complementary methods for contextualized document embeddings: first, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss; second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation. Results show that both methods achieve better performance than biencoders in several settings, with differences especially pronounced out-of-domain. We achieve state-of-the-art results on the MTEB benchmark with no hard negative mining, score distillation, dataset-specific instructions, intra-GPU example-sharing, or extremely large batch sizes. Our method can be applied to improve performance on any contrastive learning dataset and any biencoder.
more » « less
Full Text Available
Compute-Constrained Data Selection

Yin, Junjie Oscar; Rush, Alexander M (April 2025, ICLR)

Data selection can reduce the amount of training data needed to finetune LLMs; however, the efficacy of data selection scales directly with its compute. Motivated by the practical challenge of compute-constrained finetuning, we consider the setting in which both the cost of selecting data and training are budgeted for. We first formalize the problem of data selection with a cost-aware utility function, and model the data selection problem as trading off initial-selection cost for training gain. We run a comprehensive sweep of experiments across multiple tasks, varying compute budget by scaling finetuning tokens, model sizes, and data selection compute. Interestingly we find that many powerful data selection methods are almost never compute-optimal, and that cheaper data selection alternatives dominate both from a theoretical and empirical perspective. For compute-optimal training, we find that perplexity and gradient data selection require training-to-selection model size ratios of 5x and 10x, respectively
more » « less
Full Text Available
Great Memory, Shallow Reasoning: Limits of kNN-LMs

Geng, Shangyi; Zhao, Wenting; Rush, Alexander M (April 2025, NAACL)

Full Text Available
Challenges in Trustworthy Human Evaluation of Chatbots

Zhao, Wenting; Rush, Alexander M; Goyal, Tanya (April 2025, NAACL)

Full Text Available
Workshop Report: Open-Source Generative AI

Rush, Alexander M (December 2024, https://srush.github.io/osgai/index.html)

Full Text Available
COMMIT0: LIBRARY GENERATION FROM SCRATCH

Zhao, Wenting; Jiang, Nan; Lee, Celine; Chiu, Justin T; Cardie, Claire; Galle, Matthias; Rush, Alexander M (April 2025, ICLR)

Full Text Available
Triton-Viz: Visualizing GPU Programming in AI Courses

https://doi.org/10.1145/3641554.3701795

Ramesh, Tejas; Rush, Alexander; Liu, Xu; Yin, Binqian; Zhou, Keren; Jiao, Shuyin (February 2025, ACM)

Full Text Available
Simple Guidance Mechanisms for Discrete Diffusion Models

Schiff, Yair; Sahoo, Subham; Phung, Hao; Wang, Guanghan; Boshar, Sam; Dalla-torre, Hugo; de_Almeida, Bernardo; Rush, Alexander; Pierrot, Thomas; Kuleshov, Volodymyr (April 2025, ICLR)

Full Text Available
I Could’ve Asked That: Reformulating Unanswerable Questions

Zhao, Wenting; Gao, Ge; Cardie, Claire; Rush, Alexander M (November 2024, EMNLP)

When seeking information from unfamiliar documents, users frequently pose questions that cannot be answered by the documents. While existing large language models (LLMs) identify these unanswerable questions, they do not assist users in reformulating their questions, thereby reducing their overall utility. We curate CouldAsk, an evaluation benchmark composed of existing and new datasets for document-grounded question answering, specifically designed to study reformulating unanswerable questions. We evaluate state-of-the-art open-source and proprietary LLMs on CouldAsk. The results demonstrate the limited capabilities of these models in reformulating questions. Specifically, GPT-4 and Llama2-7B successfully reformulate questions only 26% and 12% of the time, respectively. Error analysis shows that 62% of the unsuccessful reformulations stem from the models merely rephrasing the questions or even generating identical questions. We publicly release the benchmark and the code to reproduce the experiments.
more » « less
Full Text Available
The Mamba in the Llama: Distilling and Accelerating Hybrid Models

Wang, Junxiong; Paliotta, Daniele; May, Avner; Rush, Alexander M; Dao, Tri (December 2024, NeurIPS)

Linear RNN architectures, like Mamba, can be competitive with Transformer models in language modeling while having advantageous deployment characteristics. Given the focus on training large-scale Transformer models, we consider the challenge of converting these pretrained models for deployment. We demonstrate that it is feasible to distill large Transformers into linear RNNs by reusing the linear projection weights from attention layers with academic GPU resources. The resulting hybrid model, which incorporates a quarter of the attention layers, achieves performance comparable to the original Transformer in chat benchmarks and outperforms open-source hybrid Mamba models trained from scratch with trillions of tokens in both chat benchmarks and general benchmarks. Moreover, we introduce a hardware-aware speculative decoding algorithm that accelerates the inference speed of Mamba and hybrid models. Overall we show how, with limited computation resources, we can remove many of the original attention layers and generate from the resulting model more efficiently. Our top-performing model, distilled from Llama3-8B-Instruct, achieves a 29.61 length-controlled win rate on AlpacaEval 2 against GPT-4 and 7.35 on MT-Bench, surpassing the best 8B scale instruction-tuned linear RNN model. We also find that the distilled model has natural length extrapolation, showing almost perfect accuracy in the needle-in-a-haystack test at 20x the distillation length.
more » « less
Full Text Available

« Prev Next »

Search for: All records